This is the introduction to my final assignment for the open data science I will be working on the alcohol data set
This project looks into the factors affecting academic performances of students, absences and also alcohol consumption pattern. Multiple Correspondence Analysis, regression models(GLM, GAM and GBM), and Linear Discriminant analysis were used to explore the data. The result shows that female students are more successful in their studies than male students because of several factors. They get more supports for studies, are more ambitious to further to higher education, and less absent, have less high alcohol consumption, than thier male counterparts.
Mothers’ education levels, to a considerable extent, also affect students’ performances and based on the MCA, female students are closer to their mother, which might be another underlying factor for their better success ahead of male students in their studies.
Different measures were adopted to validate the models. Of all the techniques adopted, the regression models seemed to be the weakest and might require more observations and/or predictors. The prediction for the binomial variable of high alcohol consumption was however, fairly good, after being evaluated by Receiver operating characteristic(ROC) Area Under Curve(AUC), with average value of about 0.7 for the various models. The precision(0.75) and accuracy(0.79) were quite good too but performed better when predicting FALSE.
By and large, the result shows that female students perform better and describes the factors which play roles in their success.
#clear environment
rm(list=ls())
#read data
fpath<- "C:/Users/oyeda/Desktop/OPEN_DATA_SCIENCE/IODS-final/data/"
alco_data <- read.table(paste0(fpath, "alcohol_student.csv"), sep=",", header=T)
alco_data<- alco_data[,-1]
#necessary packages
#install.packages("dismo")
library(dplyr)
library(ggplot2)
library(corrplot)
library(GGally)
library(tidyr)
library(tidyverse)
library(gbm)
library(dismo)
library(caTools)
library(mgcv)
library(MASS)
library(FactoMineR)
Attribute Information:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â???" 5th to 9th grade, 3 â???" secondary education or 4 â???" higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â???" 5th to 9th grade, 3 â???" secondary education or 4 â???" higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 high_use - high alcohol consumption(boolean: TRUE or FALSE)
32 alc_use - alcohol consumption(numeric:average of Dalc & Walc)
these grades are related with the course subject, Math or Portuguese:
33 G1 - first period grade (numeric: from 0 to 20)
34 G2 - second period grade (numeric: from 0 to 20)
35 G3 - final grade (numeric: from 0 to 20, output target)
My aims are to demonstrate the use of various statistical techniques in getting insight in to the dataset, to understand the patterns of alcohol consumption consumptions and absences amongst students in two schools. Furthermore, I seek to explore the distribution of grades amongst male and female students; and factors that might be affecting this. Firstly, I will use the basic descriptive statitstics to understand the distribution, correlation and dependencies(corrplot, barplot, histogram, biplot etc) I will also be using the regression models(Generalized Linear Model(GLM), Generalised Additive Model(GAM) and Generalised Boosted Model(GBM)).
Also, I will test my regressions models using 70:30 data split, and boostrapping with replacement sampling.
Lastly. I will adopt Linear Discriminant Analysis(LDA) and Multiple Correspondence Analysis(MCA), to classify and categorise the data.
What are the factors considerably affecting students’ grades?
What are the factors causing absences amongst students?
What are the factors significantly causing high alcohol consumption amongst students.
High alcohol consumption has no effect on students’ performances.
students’ grades are not affected by absences.
Mothers’ level of education does not affect their kids’ academic performances.
Male and Female students get same level of support.
The data includes 35 variables and 382 observations which describe the performances of secondary school students in two distinct subjects: Mathematics and Portuguese. It also contains demographic, social and other information about schooling. Other information related to the original data can be found from this source.
After downloading the data, it was wrangled by combining the two different datasets related to Performances in Mathematics and Portuguese. Thereafter, Weekend and weekdays alcohol consumptions were combined by finding their average. Afterwards, a threshold of 2 was selected for high alcohol consumption and a boolean variable(“high_use”) was created from that. The link to my DATA WRANGLING is here.
The wrangled data can also be found here.
The response/target variables of interest are: grades(G3), high alcohol consumption(high_use), alcohol consumption(alc_use) and absences. G3 is the final grades of the year of the students. Other information related to the description of the variables can be found from the source given above.
All in all, there are 35 variables and 382 observations in the dataset.
categ = apply(alco_data, 2, function(x) nlevels(as.factor(x)))
#categ
dim(alco_data)
## [1] 382 35
str(alco_data)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
## $ reason : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
## $ famsup : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ paid : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
summary(alco_data)
## school sex age address famsize Pstatus
## GP:342 F:198 Min. :15.00 R: 81 GT3:278 A: 38
## MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104 T:344
## Median :17.00
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## Medu Fedu Mjob Fjob
## Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.448
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.037 Mean :0.2016
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000
## yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000
## Mean :3.937 Mean :3.22 Mean :3.113
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0
## Median :1.000 Median :2.000 Median :4.000 Median : 3.0
## Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0
## G1 G2 G3 alc_use
## Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:268
## TRUE :114
##
##
##
#glimpse(alco_data)
# ####Grade:
#bar plot of the variables
gather(alco_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
The above show the distributions of the chosen predictors and also scale of alcohol consumption. The ages of the students are mainly within 15-19 with few students aged 20 and 22. Generally, there are relatively few chronic alcohol consumers amongst the students. The students seem to have quite enough free time. The respondents include 198 female students and 184 male students.
hyp<- alco_data[,c("age", "sex", "absences","freetime","alc_use")]
#show the overview of the predictors and response variable(non-binomial)
plot_hyp <- ggpairs(hyp, mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
plot_hyp #draw the plot
Here, we can see that the female students seem to be generally quite older and there are both male and female students that are much older than the general sample of the students. The absences are not so high but some chronic absentees amongst the students(both male and female). Most of the students are quite free but a few of the students are very busy. Largely, all the predictors show no correlation, hence, the issue of multicolinearity is not a problem here. However, these predictors still show very weak correlation with alcohol consumption. Nevertheless, it should be recalled that correlation is not causation. Hence, I will dig further to understand the effects of these predictors on alcohol consumption amongst students and see if they are significant.
Crosstabs The below crosstabs compare the predictors with alcohol consumption
#using the binomial response(high_use)
#it is also possible to use simple crosstabs e.g
#xtabs(high_use~age, data = alc)
#but below is a more comprehensive crosstab.
#Crosstabs
summarise(group_by(alco_data, age,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 12 x 4
## # Groups: age [?]
## age high_use count mean_grade
## <int> <lgl> <int> <dbl>
## 1 15 FALSE 63 12.079365
## 2 15 TRUE 18 11.388889
## 3 16 FALSE 79 12.126582
## 4 16 TRUE 28 11.250000
## 5 17 FALSE 64 11.578125
## 6 17 TRUE 36 11.305556
## 7 18 FALSE 54 11.370370
## 8 18 TRUE 27 9.777778
## 9 19 FALSE 7 7.857143
## 10 19 TRUE 4 8.750000
## 11 20 FALSE 1 16.000000
## 12 22 TRUE 1 6.000000
#an alternative and preferrable way of doing the above
#alc %>% group_by(age, high_use) %>% summarise(count=n(), mean_grade = mean(G3))
summarise(group_by(alco_data, sex,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 4 x 4
## # Groups: sex [?]
## sex high_use count mean_grade
## <fctr> <lgl> <int> <dbl>
## 1 F FALSE 156 11.39744
## 2 F TRUE 42 11.71429
## 3 M FALSE 112 12.20536
## 4 M TRUE 72 10.27778
summarise(group_by(alco_data, absences,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 43 x 4
## # Groups: absences [?]
## absences high_use count mean_grade
## <int> <lgl> <int> <dbl>
## 1 0 FALSE 52 12.13462
## 2 0 TRUE 13 11.23077
## 3 1 FALSE 38 11.78947
## 4 1 TRUE 13 10.92308
## 5 2 FALSE 42 11.64286
## 6 2 TRUE 16 10.93750
## 7 3 FALSE 33 11.75758
## 8 3 TRUE 8 11.50000
## 9 4 FALSE 24 12.08333
## 10 4 TRUE 12 10.25000
## # ... with 33 more rows
The age distribution by sex
# initialize a plot of 'age'
sex_age <- ggplot(data = alco_data, aes(x=age, col=sex))
# draw a bar plot of age by sex
sex_age + geom_bar() + facet_wrap("sex")
High alcohol consumption by sex
# initialize a plot of 'high_use'
hu_sex <- ggplot(data = alco_data, aes(x=sex, col=sex))
# draw a bar plot of high_use by sex
hu_sex + geom_bar() + facet_wrap("high_use")
Here, we can see that there are more female students who are not high consumers of alcohol compared to men. Men consume more alchols than their female counterparts
High use vs absences
# initialize a plot of 'high_use'
hu_ab <- ggplot(data = alco_data, aes(x=absences, col=sex))
# draw a bar plot of 'high_use' by freetime
hu_ab + geom_bar() + facet_wrap("high_use")
High use vs age
# initialize a plot of 'high_use'
hu_ag <- ggplot(data = alco_data, aes(x=age, col=sex))
# draw a bar plot of 'high_use' by freetime
hu_ag + geom_bar() + facet_wrap("high_use")
High use vs romantic
# initialize a plot of 'high_use'
hu_rom <- ggplot(data = alco_data, aes(x=romantic, col=sex))
# draw a bar plot of 'high_use' by freetime
hu_rom + geom_bar() + facet_wrap("high_use")
For students in romantic relationships, they generally have lower number of them that are high consumers of alcohol. This is same with those in no romantic relationship.
BOXPLOTS
Below are boxplots to give clearer pictures, as explained earlier. Relationship of alcohol consumption with absences
# initialise a plot of high_use and absences
h_ab<- ggplot(alco_data, aes(x=high_use, y=absences,col=sex))
# define the plot as a boxplot and draw it
h_ab + geom_boxplot() + ggtitle("Student absences vs alcohol consumption")
There are a few chronic absentees amongst male and female students, but generally, most of the students have low absences. It seems some students(both high and non-high alcohol consumers) are frequently absent. However, high consumers appear to be more absent, overall.
Relationship of alcohol consumption with age
# initialise a plot of high_use and age
h_ag<- ggplot(alco_data, aes(x=high_use, y=age,col=sex))
# define the plot as a boxplot and draw it
h_ag + geom_boxplot() + ggtitle("Students' age vs alcohol consumption")
Relationship of alcohol consumption with sex
# initialise a plot of high_use and sex
alc_sex<- ggplot(alco_data, aes(y=alc_use, x=sex,col=sex))
# define the plot as a boxplot and draw it
alc_sex + geom_boxplot() + ggtitle("Student's alcohol consumption by sex")
There are a few female students with quite high alcohol consumption but the alcohol consumption levels do not vary as much as they do amongst the male students. Generally, male students tend to consume more alcohol.
To understand factors that could affect students’ performances, alcohol consumption pattern and absences, I will be utilising three different approaches. The first is the Multiple Correspondence Analysis(MCA) which is used for categorical variables. I will firstly explore the already categorised variables and further categorise more continuous variables for MCA. This is useful to understand the associations amongst the variables.
The second approach involves the use of various regression models: Generalised Linear Model(GLM), Generalised Additive Model(GAM) and General Boosted Model(GBM). These will be used to further understand the causations. Furthermore, two GBM packages in R-“gbm” and “dismo”- will be used to compare their performances. In selection of the predictors for GLM, stepwise regression, t-test from model summary and anova(Chi square) test will be performed to eliminate redundancies. In GAM, variables will be selected in similar manner, and a smoothing spline function be used with appropriate degree of freedom.
Also, the models of the grades(G3) and absences, will be evaluated, by using correlation, Mean Absolute error and Root Mean Square Error(RMSE). However, the models of binomial response variable - high alcohol consumption(high_use) - will be evaluated by adopting the Receiver operating characteristic(ROC) Area Under Curve(AUC). Confusion/Classification matrix will also be used to evaluate the precision, accuracy, sensitivity and specificity. The odds ratio will also be looked into.
In the course of the evaluation of the regression models, the data will be divided into 70% calibration or training data and 30% evaluation or testing data. The data is also sampled by replacement and boostrapped ten times.
Lastly, Linear Discriminant Analysis will be performed on categorised grades(G3) and absences. it will also be evaluated by a classification matrix.
To avoid, confusion, I will copy the data into a new data frame for MCA. Firstly, I will be extracting the categorical variables, as factors.
#To avoid, confusion, I will copy the data into a new data frame for MCA
data_mca <- alco_data
#Firstly, I will be extracting the categorical variables, as factors
#now, filter them out for the MCA
data_mca_fac1<-Filter(is.factor, data_mca)
high_use<-factor(alco_data[,"high_use"])
data_mca_fac1 <- cbind.data.frame(data_mca_fac1, high_use)
MCA plot
#par(mfrow=c(1,2))
#now, perform the MCA on the catergorial/qualitative variables
mca_alco1 <- MCA(data_mca_fac1, graph = T)
#plot it
plot(mca_alco1, invisible=c("ind"), habillage="quali")
#par(mfrow=c(1,1))
BETTER VISUALISATION FOR MCA
# Getting a better biplot for MCA using ggplot
## number of categories per variable
categ = apply(data_mca_fac1, 2, function(x) nlevels(as.factor(x)))
categ
## school sex address famsize Pstatus Mjob
## 2 2 2 2 2 5
## Fjob reason nursery internet guardian schoolsup
## 5 4 2 2 3 2
## famsup paid activities higher romantic high_use
## 2 2 2 2 2 2
# data frame with variable coordinates
mca_vars_df = data.frame(mca_alco1$var$coord, Variable = rep(names(categ), categ))
#mca_vars_df
# data frame with observation coordinates
mca_obs_df = data.frame(mca_alco1$ind$coord)
# MCA plot of observations and categories
ggplot(data = mca_obs_df, aes(x = Dim.1, y = Dim.2)) +
geom_hline(yintercept = 0, colour = "gray70") +
geom_vline(xintercept = 0, colour = "gray70") +
geom_point(colour = "gray50", alpha = 0.0) +
geom_density2d(colour = "gray80") +
geom_text(data = mca_vars_df,
aes(x = Dim.1, y = Dim.2,
label = rownames(mca_vars_df), colour = Variable)) +
ggtitle("MCA plot of variables using R package FactoMineR") +
scale_colour_discrete(name = "Variable")
Here, we can see that female students generally attend school because of the reputation and get more school support. They also have family support and are able to attend paid extra classes. Most of their father’s job are in the health sector. High alcohol consumption is more rampant amongst male students compared to their female counterparts.
Male students also seem attend the school based on course preference and other reasons. By and large, they do not attend paid extra classes compared to the female students and also do not get family and school support like the frmale students. Lastly, unlike the female students, they mosly have no intention of pursuing higher education. This is explored further by categorising more continuous variables such as grades, absences, to allow for comparison with other variables.
CATEGORISE MORE VARIABLES
Next, I will be categorising grade(G3), absences, Mother’s education(Medu), father’s education
#Next, I will be categorising grade(G3), absences, Mother's education(Medu),
#father's education
data_mca2<- alco_data
#now, convert mother and father's education level into something more understandable.
#data_mca$Medu<- factor(data_mca$Medu)
data_mca2$Medu<- factor(data_mca2$Medu, labels=c("no", "pr","5-9th", "sec", "hiE"))
data_mca2$Fedu<- factor(data_mca2$Medu, labels=c("no", "pr","5-9th", "sec", "hiE"))
#Now, let's categorise grades according to quartiles
bins_abs<- quantile(data_mca2$absences)
data_mca2$absences<-cut(data_mca2$absences, breaks=bins_abs, include.lowest=T,
labels=c("vlow", "low", "high", "vhigh"))
#same to grade(G3)
bins_gra<- quantile(data_mca2$G3)
data_mca2$G3<-cut(data_mca$G3, breaks=bins_gra, include.lowest=T,
labels=c("vlow", "low", "high", "vhigh"))
#Getting the columns that are factors
#since MCA is for categorical variables, I will be filtering the categorical
#variables/factors and also categorise some of the variables of interest
#such as absences, grade(G3). I will also Categorise other variables of interest that are in integer forms.
#let's first see the variables already in factor format
names(Filter(is.factor, data_mca2))
## [1] "school" "sex" "address" "famsize" "Pstatus"
## [6] "Medu" "Fedu" "Mjob" "Fjob" "reason"
## [11] "nursery" "internet" "guardian" "schoolsup" "famsup"
## [16] "paid" "activities" "higher" "romantic" "absences"
## [21] "G3"
#now, filter them out for the MCA
data_mca_fac2_<-Filter(is.factor, data_mca2)
#include the high alcohol use column
high_use<-factor(alco_data[,"high_use"])
#join with the dataframe with categorical varianles
data_mca_fac2_<-cbind.data.frame(data_mca_fac2_, high_use)
str(data_mca_fac2_)
## 'data.frame': 382 obs. of 22 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ address : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
## $ Medu : Factor w/ 5 levels "no","pr","5-9th",..: 2 2 3 3 4 4 4 3 4 4 ...
## $ Fedu : Factor w/ 5 levels "no","pr","5-9th",..: 2 2 3 3 4 4 4 3 4 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
## $ reason : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
## $ famsup : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ paid : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
## $ absences : Factor w/ 4 levels "vlow","low","high",..: 2 2 4 2 3 2 1 1 4 4 ...
## $ G3 : Factor w/ 4 levels "vlow","low","high",..: 2 1 2 1 2 2 1 1 4 1 ...
## $ high_use : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 2 1 1 1 2 1 ...
#The above can also be done with dplyr which has been loaded already too
#data_mca %>% Filter(f = is.factor)
#names of the variables again ?
names(data_mca_fac2_)
## [1] "school" "sex" "address" "famsize" "Pstatus"
## [6] "Medu" "Fedu" "Mjob" "Fjob" "reason"
## [11] "nursery" "internet" "guardian" "schoolsup" "famsup"
## [16] "paid" "activities" "higher" "romantic" "absences"
## [21] "G3" "high_use"
#Alternative for finding the column name that are factors
# names(data_)[ sapply(data_reg, is.factor) ]
# #or
# data_reg %>% Filter(f = is.factor) %>% names
# or
# data_reg %>% summarise_each(funs(is.factor)) %>% unlist %>% .[.] %>% names
#I will select few of the variables for clarity sake and include the sex too
data_mca_fac2<-data_mca_fac2_[,c("sex",names(data_mca_fac2_[,(10:ncol(data_mca_fac2_))]))]
#now, perform the MCA on the catergorial/qualitative variables
mca_alco2 <- MCA(data_mca_fac2, graph = T)
#plot it
plot(mca_alco2, invisible=c("ind"), habillage="quali")
# Getting a better biplot for MCA using ggplot
## number of categories per variable
categ2 = apply(data_mca_fac2, 2, function(x) nlevels(as.factor(x)))
categ2
## sex reason nursery internet guardian schoolsup
## 2 4 2 2 3 2
## famsup paid activities higher romantic absences
## 2 2 2 2 2 4
## G3 high_use
## 4 2
# data frame with variable coordinates
mca_vars_df2 = data.frame(mca_alco2$var$coord, Variable = rep(names(categ2), categ2))
#mca_vars_df2
# data frame with observation coordinates
mca_obs_df2 = data.frame(mca_alco2$ind$coord)
# MCA plot of observations and categories
ggplot(data = mca_obs_df2, aes(x = Dim.1, y = Dim.2)) +
geom_hline(yintercept = 0, colour = "gray70") +
geom_vline(xintercept = 0, colour = "gray70") +
geom_point(colour = "gray50", alpha = 0.1) +
geom_density2d(colour = "gray80") +
geom_text(data = mca_vars_df2,
aes(x = Dim.1, y = Dim.2,
label = rownames(mca_vars_df2), colour = Variable)) +
ggtitle("MCA plot of variables") +
scale_colour_discrete(name = "Variable")
Here, we can see the categorisation and association In the NorthWest quadrant, it shows that those that pay for for extra classes in portugese and math, have family support and are mostly female students. They also get extra education support and have ambition for higher education. In addition, they mostly had nursery education too.
In the NorthEast quadrant, high alcohol use seem to be associated with guardian other than mother or father. Those with high alcohol use also seem to not have ambition for higher education. They also perform poorly in studies(low grade). and mostly choose school because it is closer to home amongst other reasons, other than reputation and course preference. They also do not seem to participate in extracurricular activities and are mostly in romantic relationship. They also frequently absent.
In the SouthWest quadrant, it can be seen that those that have low absence perform very well in studies do not take excessive alcohol. They are also engaged in extracurricular activities and have their mother as their guardian. They are also motivated to attend the school because of shool’s reputation.
In the SouthEast quadrant, Male students seem to be quite frequently absent and do not attend paid extra classes in portuguese and math. They mostly attend the school because of the course preference. They generally also have no internet access. They also mostly do not get school’s support. It also looks like their father are their guardian.
What I deduce here is that female gets more support because they are are with their mothers or parents generally whom are more educated and work in the health sector while male students seem to have the opposite situation. This support and ambition for higher education also motivate them to perform better in studies and be more frequently present.
This will be further explored with family of regression models (GLM, GAM, GBM)
GLM, GAM, GBM are quite similar but are respectively increasingly more detailed in delivering their predictions. They are all capable of handling not only gaussian distribution but also binomial and Poisson without having to transform the response variable via the identity function which uses logit and log for binomial and Poisson distributions respectively.
However, GAM has the advantage over GLM of being able to present some response curves from the models more readily. It offers more flexibility. In similar vein, GBM offers even more flexibility in dealing with the explanatory variables and their interactions. By recursive binary splitting also, it is able to produce model with improved predictive performance and less error. This is done with the number of internal iterations in it. It also presents the relative importance of the predictors.
Before I proceed, there is need to determine the error distribution of the response variables here. As in many realistic cases, we do not have normal distribution(i.e gaussian), we could also have poisson distribution for count data and last is the binomial distribution (e.g Yes/No, Presence/Absence). That said, one of the assumptions of poisson distribution is that variance is greater than mean. Therefore, I will be creating a function to test this assumption and decide the kind of distribution. You can see more here
Create function to test variance>mean assumption of poisson distribution
#Create function to test variance>mean assumption of poisson distribution
test.var.mn<- function(x){
if(var(x)>mean(x)){
print("VALID:The variance>mean assumption of poisson distribution is valid")
}
else{
print("INVALID:The variance>mean assumption of poisson distribution is invalid")
}
}
TEST THE ASSUMPTION OF POISSON DISTRIBUTION ON AGE AND ABSENCES
#see if Grade(G3) meets the assumption
test.var.mn(alco_data$G3)
## [1] "INVALID:The variance>mean assumption of poisson distribution is invalid"
#next, see if "absences" meets the assumption
test.var.mn(alco_data$absences)
## [1] "VALID:The variance>mean assumption of poisson distribution is valid"
It shows here that it is invalid for grade(G3) but valid for “absences” variable. Next, therefore, I will be exploring various regression models, with grade(G3) as gaussian, Absences as poisson and high alcohol use(High_use) as binomial.
As done earlier, the data will be copied into a new dataframe for all the regression analysis, to avoid confusion and alteration of the original data.
For GLM, predictors will be selected, by firstly using stepwise regression to eliminate the redundant variables. Afterwards, I will be employing the significance test from the model summary, anova test(Chi Square), and AIC values. I also checked if any of the predictor is curvillinear(i.e has higher order polynomial).
Anova Chi square test and the T.test from the model summary were used to further eliminate the insiginificant variables at level 0.05.
Here, we can see that the significant factors affecting students’ grades include, address, mother’s level of education, studytime, ambition to further higher education, and going out. Mother’s education and higher education ambition seem to be the most significatn factors. However, mother’s education level is not curvillear(i.e does not have higher order polynomial(i.e “I(Medu^2)”)
In accordance to the principle of parsimony, I decided to use variables with significance level of 0.05 Therefore, Final model: grade_glm<-glm(G3~ address + Medu + studytime + higher + goout,data=data_reg,family =“gaussian”)
let’s also see the if any observation has a leverage and to further test the normality assumption while I also explore the how the residuals differ from the fitted
par(mfrow=c(2,2))
plot(grade_glm, which = c(1,2,5))
From this, the constant variance of the error seem to be in line. The distribution also appears to be normal from the Normal Q-Q plot although, slight divergence from the line at the lower level. Also, no observation seem to have a clear leverage over others aside a point lying quite farther from others. Nevertheless, gaussian distribution can be safely utilised for modelling the grde(G3) response variable.
Mean Error and RMSE
Create Functions to calculate Mean Error and Root Mean Square Error(RMSE)
# function to calculate the mean absolute and RMSE
#function to calculate mean error
mean_error<- function(obs, pred){
me<-mean(abs(obs-pred))
return(me)
}
# Function that returns Root Mean Squared Error
rmse <- function(obs, pred){
rmse<-sqrt(mean((obs-pred)^2))
return(rmse)
}
EVALUATION AND TESTING OF MODELS
This part is for the model validation. The data was divided into 70% training/calibration data and 30% testing/evaluation data. It was also boostrapped 10 times. The results of the models and the response curves were thereafter presented. dividing into 70:30.
let’s see the correlation between the predicted and observed response variable
#correlation between the predicted and observed grade(G3)
all_cor_grade
## grade_glm grade_gam grade_gbm1 grade_gbm2
## 1 0.3097495 0.3113278 0.3186606 0.3234973
They all are low and not so much different.
Below, we can see the errors of the various models.
#error of all the models
all_error_grade
## grade_glm grade_gam grade_gbm1 grade_gbm2
## mean abs error 2.444750 2.454319 2.459633 2.441711
## RMSE 3.109096 3.124092 3.084825 3.073649
There appears to be not much difference in the errors of all the models used.
I chose GBM because it is able to handle multicollinearity and complex interactions, it can also show the response curves and relative importance of predictors. GAM is also able to show response curves and their confidence intervals.
SUMMARY OF THE GBM FOR GRADE(G3)
summary(grade_gbm1)
## var rel.inf
## higher higher 26.1473492
## Medu Medu 12.3553867
## alc_use alc_use 9.2532659
## absences absences 7.4461181
## age age 7.2333867
## traveltime traveltime 6.5834037
## studytime studytime 6.2798924
## goout goout 5.8174172
## Fedu Fedu 3.6308778
## address address 3.4899660
## freetime freetime 2.8089002
## sex sex 1.7590326
## internet internet 1.6940603
## famsup famsup 1.6558859
## romantic romantic 1.3589037
## famrel famrel 1.2023050
## activities activities 0.8399529
## Pstatus Pstatus 0.4438958
Here, we can see that higher education pursuit, mother’s educaton level, alcohol use, and absences seem to be most important factors affecting students performances. This is quite in line with the results I got from GLM and GAM, however, only mother’s education level and higher_use seem to be the significant factors. The least important factors are also shown in the GBM’s summary of relative importance.
now, let’s see the response curves of this from GAM
grade_gam <- gam(G3~ s(Medu, k=3) + higher + address + s(studytime, k=3) + higher + goout, data = data_reg,family = "gaussian")
plot(grade_gam, pages=1)
Here, from the smooth curve from GAM, we can see that mother’s education level has a positive effect on student’s performance. However, the confidence interval especially at the lower level. is quite large, which shows that there is a wide range of uncertainty. Perharps, more observations needed. Expectedly, study time also seem to have a positive effect on students’ performances but the uncertainty is quite high.
This will be explored further in the response curves from GBM below.
best.iter1<-gbm.perf(grade_gbm1, plot.it = F, method = "OOB")
par(mfrow=c(2,2))
plot.gbm(grade_gbm1, "Medu", best.iter1)
plot.gbm(grade_gbm1, "higher", best.iter1)
plot.gbm(grade_gbm1, "alc_use", best.iter1)
plot.gbm(grade_gbm1, "absences", best.iter1)
par(mfrow=c(1,1))
As we can see here, the higher the level of education of the mother, the more their kids seem to perform better. Also, students with intention to pursue higher education seem to perform better. Alcohol use and absences reduces performance and can result in dramatic reduction if it becomes chronic. This is moreso, with being absent from school.
plot(predict.gbm(grade_gbm1, data_reg, best.iter1), data_reg$G3,
main="Observed vs Predicted grade")
lines(lowess(predict.gbm(grade_gbm1, data_reg, best.iter1), data_reg$G3), col="red", lwd=3)
r_grade <-cor.test(predict.gbm(grade_gbm1, data_reg, best.iter1), data_reg$G3)
r2grade <- r_grade$estimate^2
r2grade
## cor
## 0.3242391
legend("topleft", paste("r^2=", round(r2grade,3)))
We can see from the scatterplots that the selected variables, account for only 32% factors affecting the students’ grades.
For GLM, the selection of predictors was done as earlier by using stepwise regression, anova(Chis sq) test and significance test from the model summary.
Final model
data_reg<- alco_data
abse_glm<-glm(absences ~ sex + age + Medu +
Pstatus + studytime + higher +
internet + goout + alc_use + G3
,data=data_reg,family ="poisson")
summary(abse_glm)
##
## Call:
## glm(formula = absences ~ sex + age + Medu + Pstatus + studytime +
## higher + internet + goout + alc_use + G3, family = "poisson",
## data = data_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.1134 -1.8059 -0.6326 0.8032 9.8400
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.170583 0.410599 0.415 0.67781
## sexM -0.406371 0.054637 -7.438 1.02e-13 ***
## age 0.095102 0.020967 4.536 5.74e-06 ***
## Medu 0.116579 0.024475 4.763 1.91e-06 ***
## PstatusT -0.458253 0.069217 -6.621 3.58e-11 ***
## studytime -0.156525 0.033700 -4.645 3.41e-06 ***
## higheryes -0.220171 0.108890 -2.022 0.04318 *
## internetyes 0.349942 0.078836 4.439 9.04e-06 ***
## goout 0.007283 0.023303 0.313 0.75463
## alc_use 0.220332 0.026020 8.468 < 2e-16 ***
## G3 -0.022122 0.008040 -2.751 0.00593 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1917.2 on 381 degrees of freedom
## Residual deviance: 1622.0 on 371 degrees of freedom
## AIC: 2663.6
##
## Number of Fisher Scoring iterations: 5
anova(abse_glm, test="Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: absences
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 381 1917.2
## sex 1 6.559 380 1910.7 0.010434 *
## age 1 39.964 379 1870.7 2.587e-10 ***
## Medu 1 28.611 378 1842.1 8.848e-08 ***
## Pstatus 1 37.104 377 1805.0 1.120e-09 ***
## studytime 1 50.651 376 1754.3 1.103e-12 ***
## higher 1 10.643 375 1743.7 0.001105 **
## internet 1 26.691 374 1717.0 2.387e-07 ***
## goout 1 15.196 373 1701.8 9.692e-05 ***
## alc_use 1 72.308 372 1629.5 < 2.2e-16 ***
## G3 1 7.512 371 1622.0 0.006130 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The factors that might be causing absence of students are above.
#correlation between the predicted and observed for all the models used.
all_cor_abse
## abse_glm abse_gam abse_gbm1 abse_gbm2
## 1 0.1011032 0.2145327 0.1081803 0.1123386
#The mean correlation
all_cor_abse
## abse_glm abse_gam abse_gbm1 abse_gbm2
## 1 0.1011032 0.2145327 0.1081803 0.1123386
#let's see the errors in the prediction for "absence" response variable.
all_error_abse
## abse_glm abse_gam abse_gbm1 abse_gbm2
## mean abs error 3.238315 3.000850 2.854459 2.778519
## RMSE 4.156815 3.843834 3.366397 3.362421
I chose GBM because it is able to handle multicollinearity and complex interactions, it can also show the response curves and relative importance of predictors.
Alcohol use, age, mother’s education, parents’ status, freetime and romantic relationship ostensibly, have the most effect on the cause of absences. Travel time and address seem to be the least
GAM
#now, let's see the response curves of this
#use all the dataset
abse_gam <- gam(absences~ sex + s(age, k=3) +s(Medu, k=3) +
Pstatus+ s(studytime, k=3) + higher +
internet + goout + s(alc_use, k=3) + s(G3, k=3), data = alco_data, family = "poisson")
plot(abse_gam, pages=1)
There appears not to be enough data for older students above 20, as the confidence interval’ seem high. Hence, difficult to assert the trend in this case. Also. the likelihood of absences reduces with increase in the level of education of the mother. Studytime also reduces this while Alcohol increases the frequency of absence. It is quite unsure how grade affects but very frequent absence seem to have a dwindling effect on students’ final grade.
Furthermore, I will be expploring this further with GBM reponse curves.
RESPONSE CURVES(GBM) FOR “ABSENCES”
best.iter1<-gbm.perf(abse_gbm1, plot.it = F, method = "OOB")
par(mfrow=c(2,3))
plot.gbm(abse_gbm1, "alc_use", best.iter1)
plot.gbm(abse_gbm1, "age", best.iter1)
plot.gbm(abse_gbm1, "Medu", best.iter1)
plot.gbm(abse_gbm1, "Pstatus", best.iter1)
plot.gbm(abse_gbm1, "freetime", best.iter1)
plot.gbm(abse_gbm1, "romantic", best.iter1)
par(mfrow=c(1,1))
Alcohol use, age mother’s education seem to increase the tendency to be absent while freetime does the opposite. This is quite surprising as, I had expected the exact opposite. Also, students’ with parents apart have more tendency to be more absent than those with their parents together. Likewise, students in romantic relationship are more likely to be more absent than those without.
plot(predict.gbm(abse_gbm1, data_reg, best.iter1), data_reg$absences,
main="Observed vs Predicted absences")
lines(lowess(predict.gbm(abse_gbm1, data_reg, best.iter1), data_reg$absences), col="red", lwd=3)
r_abse <-cor.test(predict.gbm(abse_gbm1, data_reg, best.iter1), data_reg$absences)
r2abse <- r_abse$estimate^2
r2abse
## cor
## 0.239431
legend("topleft", paste("r^2=", round(r2abse,3)))
We can see from the scatterplots that the selected variables, account for only 24%(the coefficient of determination) factors affecting the students’ absences.
Similar step was repeated here. However, in evaluating the models, I utilised Area Under Curve(AUC), odds ratio and confusion/classification matrix, instead.
Final model(GLM) for high alcohol use
halc_glm<-glm(high_use ~sex + studytime + goout + absences
,data=data_reg,family ="binomial")
anova(halc_glm, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: high_use
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 381 465.68
## sex 1 14.732 380 450.95 0.0001239 ***
## studytime 1 10.242 379 440.71 0.0013731 **
## goout 1 46.759 378 393.95 8.027e-12 ***
## absences 1 12.641 377 381.31 0.0003775 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# AUC values through the various iterations.
compared_model_halc
## halc_auc_glm halc_auc_gam halc_auc_gbm1 halc_auc_gbm2
## 1 0.8039542 0.8039542 0.7439024 0.7719882
## 2 0.8005213 0.8005213 0.8420209 0.8488372
## 3 0.8358824 0.8296078 0.7886275 0.8039216
## 4 0.8092997 0.8081702 0.8091114 0.8060994
## 5 0.8210326 0.8203396 0.7962578 0.7903673
## 6 0.7115964 0.7078313 0.7353163 0.7488705
## 7 0.7285714 0.7285714 0.6967857 0.6996429
## 8 0.7813725 0.7817647 0.8011765 0.8074510
## 9 0.7742652 0.7742652 0.7788790 0.7788790
## 10 0.7913150 0.7913150 0.7798875 0.7760197
#The mean AUC values for the various models
mean_auc_halc<-colMeans(compared_model_halc)
# attach(compared_model_halc)
#This was used temporarily to see if other models are significantly better.
# wilcox.test(halc_auc_gbm1, halc_auc_gam, paird=T)
#let's see the mean auc values for all the models
mean_auc_halc
## halc_auc_glm halc_auc_gam halc_auc_gbm1 halc_auc_gbm2
## 0.7857811 0.7846341 0.7771965 0.7832077
ROC(AUC) curves of the various models.
#show the AUC curves from the last iteration on the test data
par(mfrow=c(2,2))
colAUC(halc_glm_pred, eva$high_use , plotROC = T)
## [,1]
## FALSE vs. TRUE 0.791315
colAUC(pred_halc_gam, eva$high_use , plotROC = T)
## [,1]
## FALSE vs. TRUE 0.791315
colAUC(pred_halc_gbm1, eva$high_use , plotROC = T)
## [,1]
## FALSE vs. TRUE 0.7798875
colAUC(pred_halc_gbm2, eva$high_use , plotROC = T)
## [,1]
## FALSE vs. TRUE 0.7760197
par(mfrow=c(1,1))
Here, I utilised Area Under Curve for evaluating my model. This is because it prevents subjectiveness in selecting thresholds which can significantly affect the predictive performance and commission and omission error. It does this by considering all possible threshold and utilising that which minimises the error.
Although, measures, such as prevalence have been recommended for dealing with selection of threshold to conver into binary(e.g, True or False). AUC readily takes care of this by considering all the possible thresholds and evaluates the performance, accordingly. a value of >0.9 is an excellent model while AUC values with range 0.7-0.8, 0.5-0.7, 0.5 are fairly good, poor and very poor respectively.
Here, my AUC values for all the models thorugh my boostrapping(repititive sampling) and resampling are about average of 0.7 for the three different models(GLM, GAM, GBM) which I applied. Without boostrapping, There all are about 0.81.
####################################################
#Model for alcohol use
#RESPONSE CURVES AND MODEL INTERPRETATIONS.
#Here, i decided to use the entire data for the analysis
#GAM
halc_gam<-gam(high_use~ sex+ s(studytime, k=3) + s(goout, k=3) +
s(absences, k=3) , data =data_reg, family = "binomial")
summary(halc_gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## high_use ~ sex + s(studytime, k = 3) + s(goout, k = 3) + s(absences,
## k = 3)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.4490 0.1983 -7.308 2.72e-13 ***
## sexM 0.7817 0.2656 2.944 0.00324 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(studytime) 1 1 6.095 0.0136 *
## s(goout) 1 1 36.715 1.37e-09 ***
## s(absences) 1 1 12.118 0.0005 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.227 Deviance explained = 18.1%
## UBRE = 0.024361 Scale est. = 1 n = 382
#plot the response curves from GAM
plot(halc_gam, pages=1)
From the above, we can see the response of high alcohol use to the various predictors and also the confidence interval. There appears to be lesser tendency of high alcohol use when study time increases. This is expected, as students have lesser time for social activities and parties.
On the other hand, expectedly, going out increases the tendency of high alcohol use. In similar vein, absence from school increases the tendency too. but as it can be seen from the plot, the confidence interval seems to reduce with increase in number of absences, which might be an indication of insufficient data from that region.
To get a deeper, insight, I will explore this further with GBM
#GBM
halc_gbm1<-gbm(formula = high_use~ sex + age+address+Medu+Fedu+
Pstatus+ traveltime+studytime+famsup+activities+higher+
internet+famrel+romantic+freetime+goout+ absences, data=data_reg,
distribution = "bernoulli",n.trees = 2800, shrinkage = 0.001, interaction.depth = 6,
bag.fraction = 0.75)
summary(halc_gbm1)
## var rel.inf
## goout goout 29.72788976
## absences absences 18.40705649
## sex sex 13.49867336
## famrel famrel 7.74887748
## traveltime traveltime 6.69447510
## Medu Medu 4.08510861
## studytime studytime 3.63939972
## freetime freetime 3.45167845
## age age 2.68822384
## activities activities 2.53328558
## Fedu Fedu 2.40979670
## address address 2.00560011
## famsup famsup 1.35906028
## romantic romantic 1.10107217
## internet internet 0.42585530
## Pstatus Pstatus 0.12776898
## higher higher 0.09617807
From the relative importance, we can see that goout, absences, sex and family relationship seem to have the highest effect on high alcohol use. The least important factors can also be seen from the model summary.
Now, i’ll show the response curve from GBM to see the effects
#Now, i'll show the response curve from GBM to see the effects
#plot(halc_gbm1)
best.iter1<-gbm.perf(halc_gbm1, plot.it = F, method = "OOB")
# pred_halc_gbm1<-predict.gbm(halc_gbm1,newdata = eva, best.iter1, type = "response")
par(mfrow=c(2,2))
plot.gbm(halc_gbm1, "goout", best.iter1)
plot.gbm(halc_gbm1, "absences", best.iter1)
plot.gbm(halc_gbm1, "sex", best.iter1)
plot.gbm(halc_gbm1, "famrel", best.iter1)
par(mfrow=c(1,1))
plot.gbm(halc_gbm1, "studytime", best.iter1)
As it can also be seen from the GBM reponse curves, absences and going out, increases the tendency of high alcohol use. Going out steeply affects when it is more than average. Absences of slighty higher than 10 times can even increase the tendency of high alcohol consumption. Family relationship however, reduces the tendency. and overall, male students seem to have relatively more high alcohol use than their female counterpart. Also, as shown earlier, more study time appears to reduce the tendency of high alcohol use.
#explore the odd's ratio
halc_glm_mod <- glm(high_use~ sex + studytime + goout + absences, data=data_reg, family = "binomial")
odds_ra <- exp(coef(halc_glm_mod))
#odds_ra <- coef(high_use_mod2) %>% exp #alternaive
# compute confidence intervals (conf_int)
conf_int <- exp(confint(halc_glm_mod))
#Conf_Int <- high_use_mod2 %>% confint() %>% exp #alternative
# print out the odds ratios with their confidence intervals
cbind(odds_ra, conf_int)
## odds_ra 2.5 % 97.5 %
## (Intercept) 0.04056573 0.01222314 0.1263579
## sexM 2.18525582 1.30370562 3.7010763
## studytime 0.65628388 0.46510383 0.9099505
## goout 2.06838526 1.64470314 2.6349716
## absences 1.08123884 1.03577383 1.1324143
Here, we can also see that the odd’s ratio of Male students, absences and going out are more than 1, which indicates success: that they all increase the tendency of high alcohol consumption by students. This is not surprising. Absence from school and going out would normally be expected to result in high alcohol use. The confidence interval shows that it is highly likely that these factors affect. On the other hand, study time shows failure with high alcohol consumption which is also not surprising, because when studying more, one would expect that students have less time for going out and getting drunk. These results are all in accordance with the results of the models used earlier.
ERROR IN PREDICTD HIGH ALCOHOL USE. Although, AUC is preferred, as it considers all possible threshold and prevents subjectiveness, I will be exploring a confusion matrix too and 0.5 will be made the threshold for TRUE or FALSE for high alcohol use. More information on creating functions to calculate these metrics in a more automated way can be found here.
#fit the model
# predict() the probability of high_use
probs<- predict(halc_glm_mod, type = "response")
#Copy the data again
alc<-data_reg
#add the predicted probabilities to âalca
alc$prob_high_use <- probs
#alc <- mutate(alc, probability = probabilities)
#use the probabilities to make a prediction of high_use, setting 0.5 as threshold
alc$predict_high_use<- (alc$prob_high_use)>0.5
#alc <- mutate(alc, prediction = prob_high_use>0.5)
#see the first ten and last ten original classes, predicted probabilities, and class predictions
head(alc[,c("failures", "absences", "sex", "high_use", "prob_high_use", "predict_high_use")], n=10)
## failures absences sex high_use prob_high_use predict_high_use
## 1 0 3 F FALSE 0.03910480 FALSE
## 2 1 2 F TRUE 0.27212509 FALSE
## 3 0 8 F FALSE 0.09326837 FALSE
## 4 0 2 F FALSE 0.05424023 FALSE
## 5 1 5 F TRUE 0.03386206 FALSE
## 6 0 2 F FALSE 0.05424023 FALSE
## 7 1 0 F FALSE 0.04676258 FALSE
## 8 0 1 F FALSE 0.14322690 FALSE
## 9 0 9 F TRUE 0.06105537 FALSE
## 10 0 10 F FALSE 0.12696107 FALSE
#tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$predict_high_use)
## prediction
## high_use FALSE TRUE
## FALSE 252 16
## TRUE 65 49
#Here, I derived the various classification matrix's metrics
#Accuracy: Overall, how often is the classifier correct?
#(TP+TN)/total
accuracy<- (252 + 49)/nrow(alc)
print(paste("The accuracy of the classification is", accuracy))
## [1] "The accuracy of the classification is 0.787958115183246"
#Error rate
print(paste("The error rate/misclassification is", 1-accuracy))
## [1] "The error rate/misclassification is 0.212041884816754"
#True Positive Rate:
#"Sensitivity" or "Recall": When actually yes, how often does it predict yes?
#(TP/actual yes)
print(paste("The True positive rate/Sensitivity is", 49/(65+49)))
## [1] "The True positive rate/Sensitivity is 0.429824561403509"
#False Positive Rate: i.e When actually no, how often does it predict yes?
print(paste("The false positive rate is", 16/(16+252)))
## [1] "The false positive rate is 0.0597014925373134"
#Specificity: When actually no, how often does it predict no?
#TN/actual no
#Alternatively: 1 minus False Positive Rate
print(paste("The Specificity is", 252/(252+16)))
## [1] "The Specificity is 0.940298507462687"
#Precision: When it predicts yes, how often is it correct?
#TP/predicted yes
print(paste("The precision is", 49/(16+49)))
## [1] "The precision is 0.753846153846154"
#Prevalence: How often does the yes condition actually occur in our sample?
#actual yes/total
print(paste("The prevalence is", (65+49)/nrow(alc)))
## [1] "The prevalence is 0.298429319371728"
Accuracy of the prediction is about 0.79 which is fairly good. Miclassification/error rate is about 0.21 By using threshold of 0.5, we can see that 252 FALSE high alcohol use were Truly/rightly predicted and 16 falsely predicted. There are also 49 high alcohol use that were rightly predicted to be TRUE while 65 TRUE, wrongly predicted as FALSE. The prediction seems to be better when predicting no high alcohol use(i.e Specifity) than when predicting the true value(sensitivity. This can also be seen in the plot below.
Plotting the error of prediction.
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = prob_high_use, y = high_use, col= predict_high_use))
# define the geom as points and draw the plot
g + geom_point()
ALTERNATIVE FOR CACLULATING THE ERROR
#Proportion of errors
conf_mat<-table(high_use = alc$high_use, prediction = alc$predict_high_use)
conf_mat<-prop.table(conf_mat)
addmargins(conf_mat)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65968586 0.04188482 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.82984293 0.17015707 1.00000000
#mean error of the prediction
mean(abs(alc$high_use-alc$prob_high_use)>0.5)
## [1] 0.2120419
#The below is an alternative way by firstly defining the function to calculate the mean error.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$prob_high_use)
## [1] 0.2120419
K-fold cross-validation
#K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = halc_glm_mod, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2146597
Here I used an arbitrary threshold of 0.5 but it has been suggested that prevalence can be used in selection of threshold. you can see more here about terminologies related to confusion/classification matrix.
###############################################################
#LINEAR DISCRIMINANT ANALYSIS.
#copy data again
data_copy2<- alco_data
#filter the numeric variables.
data_num<-Filter(is.numeric, data_copy2)
#Scale the numeric variables
data_num_sca<- data.frame( scale(data_num))
#get the categorised grades created earlier into the vector
G3_cat<-data_mca_fac2[,c("G3")]
#add to to the scaled data frame
data_num_G3<- cbind(data_num_sca, G3_cat)
#summary
summary(data_num_G3)
## age Medu Fedu traveltime
## Min. :-1.3519 Min. :-2.5831 Min. :-2.3402 Min. :-0.6434
## 1st Qu.:-0.4997 1st Qu.:-0.7422 1st Qu.:-0.5158 1st Qu.:-0.6434
## Median : 0.3525 Median : 0.1783 Median : 0.3964 Median :-0.6434
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3525 3rd Qu.: 1.0988 3rd Qu.: 1.3086 3rd Qu.: 0.7939
## Max. : 4.6133 Max. : 1.0988 Max. : 1.3086 Max. : 3.6683
## studytime failures famrel freetime
## Min. :-1.23721 Min. :-0.3458 Min. :-3.24319 Min. :-2.2541
## 1st Qu.:-1.23721 1st Qu.:-0.3458 1st Qu.: 0.06937 1st Qu.:-0.2233
## Median :-0.04374 Median :-0.3458 Median : 0.06937 Median :-0.2233
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:-0.04374 3rd Qu.:-0.3458 3rd Qu.: 1.17356 3rd Qu.: 0.7921
## Max. : 2.34320 Max. : 4.8004 Max. : 1.17356 Max. : 1.8075
## goout Dalc Walc health
## Min. :-1.8897 Min. :-0.5434 Min. :-1.0162 Min. :-1.8397
## 1st Qu.:-0.9952 1st Qu.:-0.5434 1st Qu.:-1.0162 1st Qu.:-0.4099
## Median :-0.1007 Median :-0.5434 Median :-0.2320 Median : 0.3051
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7938 3rd Qu.: 0.5847 3rd Qu.: 0.5522 3rd Qu.: 1.0200
## Max. : 1.6883 Max. : 3.9691 Max. : 2.1206 Max. : 1.0200
## absences G1 G2 G3
## Min. :-0.8238 Min. :-3.5147 Min. :-2.6409 Min. :-3.4614
## 1st Qu.:-0.6407 1st Qu.:-0.5517 1st Qu.:-0.5185 1st Qu.:-0.4405
## Median :-0.2746 Median : 0.1891 Median : 0.1889 Median : 0.1637
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2746 3rd Qu.: 0.9298 3rd Qu.: 0.8963 3rd Qu.: 0.7679
## Max. : 7.4139 Max. : 2.4113 Max. : 2.3112 Max. : 1.9763
## alc_use G3_cat
## Min. :-0.9024 vlow :143
## 1st Qu.:-0.9024 low :107
## Median :-0.3947 high : 64
## Mean : 0.0000 vhigh: 68
## 3rd Qu.: 0.6207
## Max. : 3.1592
DIVIDE DATA INTO TEST AND TRAIN DATA FOR VALIDATION.
#now, divide into 80:20 train:test data
samp <- sample(1:nrow(data_num_G3), size = nrow(data_num_G3)*0.8)
train <- data_num_G3[samp,]
test <- data_num_G3[-samp,]
#Dimension.
dim(train)
## [1] 305 18
#remove the grades columns
train$G1<-train$G2<-train$G3<-NULL
#check the column names now
colnames(train)
## [1] "age" "Medu" "Fedu" "traveltime" "studytime"
## [6] "failures" "famrel" "freetime" "goout" "Dalc"
## [11] "Walc" "health" "absences" "alc_use" "G3_cat"
fitting the categorised grade to all other variables
#fit the categorised grade to all other variables
lda.fit <- lda(G3_cat~., data = train)
#lda.fit
#create arrow function
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#convert the categorised grade into numeric for the LDA plot
train$G3_cat <- as.numeric(train$G3_cat)
#plot
plot(lda.fit, dimen = 2, col = train$G3, pch= train$G3)
lda.arrows(lda.fit, myscale = 2)
failures in the past is contributing most to poor grades. Absences also appear to be. Mother’s education level seems to be a factor contributing to students’ success.
LDA CLASSIFICATION ACCURACY
#see the class prediction accuracy
G3_cat<-test$G3_cat
test<-dplyr::select(test, -G3_cat)
lda.pred<-predict(lda.fit, newdata = test)
table(correct = G3_cat, predicted = lda.pred$class)
## predicted
## correct vlow low high vhigh
## vlow 13 11 3 3
## low 9 8 3 1
## high 0 2 4 1
## vhigh 3 7 3 6
The table above shows the right classification and misclassification.
Distance between variables.
#Now, see the distance between variables.
dist_sca<-dist(data_num_sca)
summary(dist_sca)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5492 4.4994 5.4263 5.6181 6.5457 13.4290
Next, cluster the data better using k-means clustering and refit the LDA using the clustered data.
First, determine the optimal number of clusters using the elbow method of TWCSS.
Distance between variables.
set.seed(123) #This is used to make the values constant when re-run
# determine the number of clusters
k_max <- 20
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(data_num_sca, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = c('point', 'line'))
############################################################
#some other methods of exploring optimal number of classes
# library(cluster)
# library(factoextra)
# # reVisualize k-means clusters
# km.res <- kmeans(data_num_sca, 3, nstart = 25)
# fviz_cluster(km.res, data = data_num_sca, geom = "point",
# stand = FALSE, frame.type = "norm")
#
# #install.packages("factoextra")
# require(cluster)
# fviz_nbclust(data_num_sca, kmeans, method = "silhouette")
#############################################################
Further, use the NbClust to confirm. There are quite many others too. Two others are in the previous code chunk where the packages “cluster” and “factoextra” can be used. See more here for other cluster methods of checking the optimal number of classes.
#install.packages("cluster")
library(NbClust)
nb <- NbClust(data_num_sca, diss=NULL, distance = "euclidean",
min.nc=2, max.nc=5, method = "kmeans",
index = "all", alphaBeale = 0.1)
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 8 proposed 2 as the best number of clusters
## * 13 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
#Based on these, I will make my number of clusters, 3
# k-means clustering
#use the euclidean distance calculated earlier.
km <-kmeans(dist_sca, centers = 3)
#perform LDA on the clustered data
lda.fit_clus<- lda(km$cluster~., data=data_num_sca)
par(mfrow=c(1,1))
plot(lda.fit_clus, dimen = 2, col = km$cluster, pch= km$cluster)
lda.arrows(lda.fit_clus, myscale = 2)
Seems past failure and alcohol use also causes poor performance
#copydata again
data_copy2<-alco_data
data_num2<-Filter(is.numeric, data_copy2)
data_LDA1_sca<- data.frame( scale(data_num2))
summary(data_LDA1_sca)
## age Medu Fedu traveltime
## Min. :-1.3519 Min. :-2.5831 Min. :-2.3402 Min. :-0.6434
## 1st Qu.:-0.4997 1st Qu.:-0.7422 1st Qu.:-0.5158 1st Qu.:-0.6434
## Median : 0.3525 Median : 0.1783 Median : 0.3964 Median :-0.6434
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3525 3rd Qu.: 1.0988 3rd Qu.: 1.3086 3rd Qu.: 0.7939
## Max. : 4.6133 Max. : 1.0988 Max. : 1.3086 Max. : 3.6683
## studytime failures famrel freetime
## Min. :-1.23721 Min. :-0.3458 Min. :-3.24319 Min. :-2.2541
## 1st Qu.:-1.23721 1st Qu.:-0.3458 1st Qu.: 0.06937 1st Qu.:-0.2233
## Median :-0.04374 Median :-0.3458 Median : 0.06937 Median :-0.2233
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:-0.04374 3rd Qu.:-0.3458 3rd Qu.: 1.17356 3rd Qu.: 0.7921
## Max. : 2.34320 Max. : 4.8004 Max. : 1.17356 Max. : 1.8075
## goout Dalc Walc health
## Min. :-1.8897 Min. :-0.5434 Min. :-1.0162 Min. :-1.8397
## 1st Qu.:-0.9952 1st Qu.:-0.5434 1st Qu.:-1.0162 1st Qu.:-0.4099
## Median :-0.1007 Median :-0.5434 Median :-0.2320 Median : 0.3051
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7938 3rd Qu.: 0.5847 3rd Qu.: 0.5522 3rd Qu.: 1.0200
## Max. : 1.6883 Max. : 3.9691 Max. : 2.1206 Max. : 1.0200
## absences G1 G2 G3
## Min. :-0.8238 Min. :-3.5147 Min. :-2.6409 Min. :-3.4614
## 1st Qu.:-0.6407 1st Qu.:-0.5517 1st Qu.:-0.5185 1st Qu.:-0.4405
## Median :-0.2746 Median : 0.1891 Median : 0.1889 Median : 0.1637
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2746 3rd Qu.: 0.9298 3rd Qu.: 0.8963 3rd Qu.: 0.7679
## Max. : 7.4139 Max. : 2.4113 Max. : 2.3112 Max. : 1.9763
## alc_use
## Min. :-0.9024
## 1st Qu.:-0.9024
## Median :-0.3947
## Mean : 0.0000
## 3rd Qu.: 0.6207
## Max. : 3.1592
#removwe the alcohol columns
data_LDA1_sca$alc_use<-data_LDA1_sca$Walc<-data_LDA1_sca$Dalc<-NULL
#see colmn names
colnames(data_LDA1_sca)
## [1] "age" "Medu" "Fedu" "traveltime" "studytime"
## [6] "failures" "famrel" "freetime" "goout" "health"
## [11] "absences" "G1" "G2" "G3"
#Get the alcohol use into a cector
alc_use<-alco_data[,c("alc_use")]
#combine with the scaled data
data_LDA2<- cbind(data_LDA1_sca, alc_use)
#summary(data_LDA2)
lda.fit2 <- lda(alc_use~., data = data_LDA2)
#lda.fit
#first, determine the optimal number of clusters using twcss.
set.seed(123)
# determine the number of clusters
k_max <- 10
alc_sca<-as.data.frame(scale(data_LDA2))
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(alc_sca, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
dist_alc<- dist(alc_sca)
km <-kmeans(dist_sca, centers = 4)
#perform LDA on the clustered data
lda.fit_clus<- lda(km$cluster~., data=alc_sca)
lda.fit_clus
## Call:
## lda(km$cluster ~ ., data = alc_sca)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.24345550 0.23036649 0.09162304 0.43455497
##
## Group means:
## age Medu Fedu traveltime studytime failures
## 1 -0.15150389 0.47524588 0.31793533 -0.2879191 0.34125036 -0.3273272
## 2 0.09100733 -0.26100930 -0.13226160 0.2385657 -0.27429626 0.2974988
## 3 0.64464336 -0.40027789 -0.56793020 0.9170400 -0.58932564 1.5656617
## 4 -0.09928495 -0.04348989 0.01173851 -0.1585163 0.07848304 -0.3044375
## famrel freetime goout health absences G1
## 1 0.06937305 -0.037671026 -0.38924340 -0.20230345 -0.2411265 1.2245145
## 2 -0.33214977 0.007491072 0.22458595 0.11007841 0.2933109 -0.6316473
## 3 -0.11991628 0.327936363 0.56380349 0.08036769 0.8028458 -0.8691426
## 4 0.16249732 -0.052009528 -0.01986174 0.03803886 -0.1896759 -0.1679210
## G2 G3 alc_use
## 1 1.2196444 1.15119286 -0.4930008
## 2 -0.6511951 -0.66705988 0.4591347
## 3 -0.9430175 -0.98425931 1.4475137
## 4 -0.1392539 -0.08379874 -0.2723962
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## age -0.07064798 -0.10742109 -0.154803905
## Medu -0.21614373 -0.28946292 -0.292725530
## Fedu -0.04019076 0.11655127 0.358301628
## traveltime 0.30661476 -0.31082720 -0.012004516
## studytime -0.03182434 -0.07466985 -0.124071455
## failures 0.38617707 -0.87803992 -0.248719853
## famrel -0.08948820 -0.04204097 -0.813033662
## freetime -0.03045788 -0.13204798 -0.033243313
## goout 0.08390191 0.25133500 0.003604871
## health -0.02027092 0.12273177 0.098613125
## absences 0.37848441 -0.27521538 0.221078909
## G1 -0.36587629 -0.65489022 0.378096333
## G2 -0.25588535 -0.69491291 0.318987655
## G3 -0.63484376 0.34177356 -0.616701007
## alc_use 0.85244671 -0.37363870 -0.003084369
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.8474 0.1444 0.0083
#plot
plot(lda.fit_clus, dimen = 2, col = km$cluster, pch= km$cluster)
lda.arrows(lda.fit_clus, myscale = 2)
Here, we can see that going out, free time and absences tend tend to cause high alcohol use. On the other hand, study time, family relationship tend to reduce it.
3D VISUALISATION OF THE CLUSTERS
model_predictors <- dplyr::select(data_LDA2, -alc_use)
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#Next, install and access the plotly package. Create a 3D plot (Cool!) of the columns of the matrix product by typing the code below.
library(plotly)
#using the plotly package for 3D plotting of the matrix #products columns.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=data_LDA2$alc_use)
Overall, the various analyses have shown the variation amongst the students. Female students seem to get more family and school supports than their male counterpart and are also able to attend paid extra classes more than the latter.
Generally, male students consume more alcohol highly and high alcohol consumption has been shown to have a negative effect on grade. Furthermore, going out has been linked to high alcohol consumption which is more common with students that are frequently absent.
On a good note, Mother’s level of education plays a very vital role in students perfomance. In similar manner, ambition for higher education was found to be a considerable factor in the academic performances of the students and are more common amongst female students than male students.
On this note, I will be rejecting all my null hypotheses, as high alcohol consumption has a significant effect on students’ performances. Students’ grades are also affected by absence.
It is also noteworthy that more samples should be collected and also more variables, as most of the variables explained less than 50% of all my target/response variables : grades, high alcohol consumption and absences.
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.